A movie chain in the southwest region, MovieMagic is considering ways in which it can increase spending on concessions. It has collected information of 2000 of its customers, some of whom are part of their loyalty program and some who are not. They have information on the following 8 variables, which they plan to use as predictors. They plan to use amount_spent (i.e. the amount spent on concessions) as the outcome variable since they have learnt from observations that much of the profit is derived from concession sales.
amount_spent = amount spent on concessions
Their objective is to find out what factors can increase concession sales and how they can improve their prediction of the outcome variable so that they can plan better.
Along with amount_spent, MovieMagic was also able to collect information form about 150 of its existing customers in the form of reviews. They feel that this text data can provide a different insight into what customers think and feel about MovieMagic.
They realize that their objective has two components: interpretation and prediction. Hence, they decide to run 3 different types of analysis. - 1. Linear regression - 2. Support Vector Regression (SVR) - 3. Text analysis
When the project, henceforth, mentions 3 analysis, the above-mentioned would be the 3 analysis.
Consider that you have been asked to run the analysis and answer the questions MovieMagic wants answered.
# EDA + general
library(skimr)
library(tidyverse)
library(psych)
library(scales)
library(knitr)
library(kableExtra) # fancy tables
library(rminer)
library(effects)
library(car)
# model training
library(caret)
library(elasticnet)
library(glmnet)
library(ROCR)
library(e1071)#to run svm
# nlp
library(quanteda)
library(seededlda)
library(topicmodels)
library(RTextTools)
library(wordcloud)
library(tm)
# Read in the data
purchases <- read.csv("http://data.mishra.us/files/project_data.csv")
reviews <- read.csv("http://data.mishra.us/files/project_reviews.csv")
This dataset has five categorical variables and 4 numeric veriables, with no missing data in any observation. The majority of movie viewers:
Ages range from 21 to 61, with a slight left skewing. On average, a person is seeing 1.9 movies and median spend rate is $216.
At first glance of a pairs panel, it appears ‘experienced’ movie goers, such as those who’ve seen a lot of movies at the theatre or who’ve been members for awhile, are spending less on concessions. ‘Expereinced’ movie-goers may be developing traditions or habits such as going to dinner beforehand or ‘tricks’ like bringing in their own food. When looking at just movies seen ~ spending, we find the outliers are mostly coming from those who are seeing 3 or fewer movies. There is a slight bump in
Those who were seen alone appear to spend less, though this may be misleading as we don’t know the size of their group. I can only imagine those who are spending over 10k on consessions are hosting parties or large groups at the theatre. Might be worth looking at removing if they’re outliers as this group would require a whole different marketing strategy catering towards coordinating groups.
skim(purchases)
| Name | purchases |
| Number of rows | 2000 |
| Number of columns | 9 |
| _______________________ | |
| Column type frequency: | |
| factor | 5 |
| numeric | 4 |
| ________________________ | |
| Group variables | None |
Variable type: factor
| skim_variable | n_missing | complete_rate | ordered | n_unique | top_counts |
|---|---|---|---|---|---|
| job | 0 | 1 | FALSE | 12 | blu: 675, tec: 286, man: 278, adm: 252 |
| marital | 0 | 1 | FALSE | 3 | mar: 1208, sin: 520, div: 272 |
| education | 0 | 1 | FALSE | 4 | sec: 1101, pri: 394, ter: 377, unk: 128 |
| seen_alone | 0 | 1 | FALSE | 2 | no: 1955, yes: 45 |
| discount | 0 | 1 | FALSE | 2 | yes: 1786, no: 214 |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| age | 0 | 1 | 40.54 | 9.12 | 21 | 33 | 39 | 47 | 61 | ▂▇▆▅▃ |
| days_member | 0 | 1 | 262.40 | 249.35 | 0 | 124 | 197 | 310 | 2462 | ▇▁▁▁▁ |
| movies_seen | 0 | 1 | 1.88 | 1.16 | 1 | 1 | 2 | 2 | 9 | ▇▂▁▁▁ |
| amount_spent | 0 | 1 | 517.18 | 980.58 | 0 | 66 | 216 | 510 | 10350 | ▇▁▁▁▁ |
pairs.panels(purchases)
hist(log(purchases$amount_spent))
ggplot(purchases, aes(job, amount_spent)) +
geom_boxplot() +
scale_y_continuous(labels=dollar)
ggplot(purchases, aes(education, amount_spent)) +
geom_boxplot() +
scale_y_continuous(labels=dollar)
ggplot(purchases, aes(factor(movies_seen), amount_spent)) +
geom_boxplot() +
scale_y_continuous(labels=dollar)
Linear regression model has an R-squared value of 0.02532, which seems very low for basing judegements on whether variables are correlated or not with the amount spent. There’s two variables which are significant.
set.seed(123)
model.lm.fullset <- lm(amount_spent ~., purchases)
summary(model.lm.fullset)
##
## Call:
## lm(formula = amount_spent ~ ., data = purchases)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1131.3 -410.4 -269.8 22.3 9575.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 243.68847 180.06384 1.353 0.17610
## age 4.78582 2.75768 1.735 0.08282 .
## jobblue-collar -19.03636 76.06472 -0.250 0.80241
## jobentrepreneur 135.54936 147.53145 0.919 0.35832
## jobmanagement 105.30974 101.24757 1.040 0.29841
## jobretired -130.82587 144.88928 -0.903 0.36667
## jobself-employed -19.70345 157.38334 -0.125 0.90038
## jobservices 0.06220 87.98871 0.001 0.99944
## jobstudent -92.11807 200.99827 -0.458 0.64679
## jobtechnician -14.76209 85.46043 -0.173 0.86288
## jobunemployed 246.57788 154.07094 1.600 0.10967
## jobunknown 33.40031 408.14602 0.082 0.93479
## jobwait-staff 590.69694 223.53039 2.643 0.00829 **
## maritalmarried -43.62729 66.70185 -0.654 0.51315
## maritalsingle -61.88122 78.09725 -0.792 0.42825
## educationsecondary 54.10943 62.70206 0.863 0.38826
## educationtertiary 264.07703 92.07081 2.868 0.00417 **
## educationunknown 63.18623 103.52489 0.610 0.54170
## seen_aloneyes -249.44346 147.38601 -1.692 0.09072 .
## discountyes 10.38529 71.48502 0.145 0.88451
## days_member 0.11692 0.08768 1.334 0.18251
## movies_seen -6.59583 18.92702 -0.348 0.72751
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 973.2 on 1978 degrees of freedom
## Multiple R-squared: 0.02532, Adjusted R-squared: 0.01497
## F-statistic: 2.447 on 21 and 1978 DF, p-value: 0.0002671
# normally distributed residuals?
mean(model.lm.fullset$residuals)
## [1] 3.273445e-14
# homoscedasticity?
plot(model.lm.fullset)
# multicollinearity?
car::vif(model.lm.fullset)
## GVIF Df GVIF^(1/(2*Df))
## age 1.335362 1 1.155578
## job 2.659907 11 1.045471
## marital 1.289658 2 1.065660
## education 2.323358 3 1.150852
## seen_alone 1.008850 1 1.004415
## discount 1.031051 1 1.015407
## days_member 1.008786 1 1.004384
## movies_seen 1.011173 1 1.005571
# autocorrelation
lmtest::dwtest(model.lm.fullset)
##
## Durbin-Watson test
##
## data: model.lm.fullset
## DW = 1.9615, p-value = 0.1933
## alternative hypothesis: true autocorrelation is greater than 0
I have a hypothesis that single individuals are hosting groups and spending a lot larger amount of money than normal. If true, I think these two groups would require different marketing strategies. For this, I predict large spending over $1,000 separately from those which are under $100, something a family 5 could spend on dinner and drinks at a theatre. Separating out these two groups results in two models both with higher R^2 values, niether of which would be high enough to feel comfortable basing heavy decisions on.
I also look at a log function of amount_spent since it’s highly skewed and zero-bounded.
purchases_eng <- purchases %>%
mutate(
isMember = factor(ifelse(days_member ==0, 0,1)),
movies_seen = factor(movies_seen)
)
purchases_high <- purchases %>%
filter(amount_spent>1000)
purchases_low <- purchases %>%
filter(amount_spent<100)
## I was going to look into predicting if someone spent at all, but will reserve for another analysis.
# purchases_zero <- purchases_eng %>%
# mutate(
# didPurchase = factor(ifelse(amount_spent==0, 0,1)),
# ) %>%
# select(-amount_spent)
model_low <- train(amount_spent ~., purchases_low, method='lm')
summary(model_low)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -49.716 -27.485 -8.225 26.906 77.980
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 13.527269 10.075597 1.343 0.179894
## age 0.507322 0.155954 3.253 0.001203 **
## `jobblue-collar` -3.935443 4.452023 -0.884 0.377052
## jobentrepreneur 3.201837 9.476769 0.338 0.735580
## jobmanagement -9.953818 6.412432 -1.552 0.121102
## jobretired -10.657489 8.151050 -1.307 0.191522
## `jobself-employed` -13.511950 8.981577 -1.504 0.132979
## jobservices -0.074223 5.129879 -0.014 0.988461
## jobstudent 5.303376 11.038702 0.480 0.631086
## jobtechnician -7.513190 4.982852 -1.508 0.132106
## jobunemployed -17.157006 10.054110 -1.706 0.088416 .
## jobunknown -8.064862 16.893099 -0.477 0.633239
## `jobwait-staff` -15.811973 14.914403 -1.060 0.289470
## maritalmarried -1.856065 3.861842 -0.481 0.630957
## maritalsingle 2.497457 4.528032 0.552 0.581450
## educationsecondary -6.935446 3.536126 -1.961 0.050284 .
## educationtertiary -5.795294 5.578084 -1.039 0.299233
## educationunknown -8.871741 5.612771 -1.581 0.114465
## seen_aloneyes -3.921053 6.342004 -0.618 0.536623
## discountyes 11.865644 3.548561 3.344 0.000876 ***
## days_member -0.004277 0.005832 -0.733 0.463652
## movies_seen 1.815173 1.048710 1.731 0.083967 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 31.94 on 628 degrees of freedom
## Multiple R-squared: 0.06488, Adjusted R-squared: 0.03361
## F-statistic: 2.075 on 21 and 628 DF, p-value: 0.003374
model_high <- train(amount_spent ~., purchases_high, method='lm')
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
summary(model_high)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2801.8 -935.7 -462.6 354.4 7122.0
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4449.8649 876.7227 5.076 7.94e-07 ***
## age -20.5677 13.0165 -1.580 0.1154
## `jobblue-collar` -138.1964 424.6701 -0.325 0.7452
## jobentrepreneur 536.7240 709.3356 0.757 0.4500
## jobmanagement 17.9080 476.0889 0.038 0.9700
## jobretired 1789.1434 1077.9855 1.660 0.0983 .
## `jobself-employed` -37.8031 870.1437 -0.043 0.9654
## jobservices -42.7880 467.9268 -0.091 0.9272
## jobstudent -27.6073 1778.1471 -0.016 0.9876
## jobtechnician -174.7023 447.6008 -0.390 0.6967
## jobunemployed 753.3475 757.3892 0.995 0.3209
## jobunknown 111.4047 1805.5523 0.062 0.9509
## `jobwait-staff` 1907.5553 949.1847 2.010 0.0456 *
## maritalmarried -758.8439 347.2423 -2.185 0.0299 *
## maritalsingle -815.0512 408.8288 -1.994 0.0474 *
## educationsecondary -224.4673 367.7496 -0.610 0.5422
## educationtertiary 230.9600 445.8317 0.518 0.6049
## educationunknown -256.8584 509.7955 -0.504 0.6148
## seen_aloneyes 252.2936 1298.9299 0.194 0.8462
## discountyes -584.2062 373.6642 -1.563 0.1193
## days_member 0.2716 0.4145 0.655 0.5130
## movies_seen 0.6603 93.2239 0.007 0.9944
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1727 on 231 degrees of freedom
## Multiple R-squared: 0.1053, Adjusted R-squared: 0.02392
## F-statistic: 1.294 on 21 and 231 DF, p-value: 0.1799
model_log <- train(log(amount_spent+1) ~., purchases, method='lm')
summary(model_log)
##
## Call:
## lm(formula = .outcome ~ ., data = dat)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.6979 -0.7966 0.3095 1.2386 4.9356
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 2.9969500 0.3647111 8.217 3.72e-16 ***
## age 0.0268490 0.0055856 4.807 1.65e-06 ***
## `jobblue-collar` -0.1080271 0.1540656 -0.701 0.483276
## jobentrepreneur 0.1480491 0.2988182 0.495 0.620338
## jobmanagement -0.1491229 0.2050723 -0.727 0.467206
## jobretired -0.6779986 0.2934667 -2.310 0.020974 *
## `jobself-employed` -0.3279678 0.3187728 -1.029 0.303678
## jobservices -0.0406757 0.1782171 -0.228 0.819486
## jobstudent 0.0058285 0.4071129 0.014 0.988579
## jobtechnician -0.2143712 0.1730962 -1.238 0.215696
## jobunemployed 0.1853215 0.3120637 0.594 0.552675
## jobunknown -0.4088848 0.8266812 -0.495 0.620930
## `jobwait-staff` 0.2097409 0.4527506 0.463 0.643229
## maritalmarried 0.0604497 0.1351016 0.447 0.654607
## maritalsingle 0.1085608 0.1581824 0.686 0.492604
## educationsecondary 0.0851712 0.1270002 0.671 0.502529
## educationtertiary 0.5189366 0.1864852 2.783 0.005442 **
## educationunknown -0.0870435 0.2096850 -0.415 0.678102
## seen_aloneyes -1.0014765 0.2985237 -3.355 0.000809 ***
## discountyes 0.8722627 0.1447896 6.024 2.02e-09 ***
## days_member 0.0002015 0.0001776 1.135 0.256658
## movies_seen -0.0030210 0.0383358 -0.079 0.937196
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.971 on 1978 degrees of freedom
## Multiple R-squared: 0.04423, Adjusted R-squared: 0.03408
## F-statistic: 4.358 on 21 and 1978 DF, p-value: 1.679e-10
Splitting the test and train sets in preparation for comparing prediction models.
one.hot <- as.data.frame(model.matrix(~. -1, purchases))
set.seed(345)
trainIndex <- sample(nrow(one.hot), (nrow(one.hot)*.7))
train <- one.hot[trainIndex,]
test <- one.hot[-trainIndex,]
test_outcome <- test[,"amount_spent"]
train_outcome <- train[,"amount_spent"]
test <- test %>% select(-amount_spent)
A short function to quickly compare test and train RMSE calues across multiple models.
compare_models <- function(test, train, expected, model) {
predict_test <- predict(model, test)
predict_train <- predict(model, train)
stats_svm <- as.matrix(rbind(
mmetric(train$amount_spent, predict_train,c("MAE","MSE","RMSE","RAE")),
mmetric(expected,predict_test,c("MAE","MSE","RMSE","RAE"))
))
rownames(stats_svm) <- c("Train Set", "Test Set")
knitr::kable(stats_svm, digits=2, caption = deparse(substitute(model))) %>%
kable_styling(bootstrap_options = c("hover"))
}
Taking a couple tuning rounds using the e1071 tuning method.
set.seed(123)
model.svm.radial <- svm(amount_spent~., data= train, kernal='radial', cost=10, scale=FALSE)
compare_models(test, train, test_outcome, model.svm.radial)
| MAE | MSE | RMSE | RAE | |
|---|---|---|---|---|
| Train Set | 421.22 | 889747.2 | 943.26 | 82.18 |
| Test Set | 461.87 | 1399393.4 | 1182.96 | 81.16 |
set.seed(123)
# perform grid search
tuneResult <- tune(svm, amount_spent~., data= train,
ranges = list(epsilon = seq(0,1,0.1), cost = 2^(2:9))
)
print(tuneResult)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## epsilon cost
## 0.5 4
##
## - best performance: 838189.6
# Draw the tuning graph
plot(tuneResult)
set.seed(123)
tuneResult2 <- tune(svm, amount_spent~., data= train,
ranges = list(epsilon = seq(.42,.55,0.01), cost = seq(2,8,1))
)
print(tuneResult2)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## epsilon cost
## 0.47 2
##
## - best performance: 822950.6
plot(tuneResult2)
set.seed(123)
tuneResult3 <- tune(svm, amount_spent~., data= train,
ranges = list(epsilon = seq(.47,.49,0.005), cost = seq(.1,3.1,.5))
)
print(tuneResult3)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## epsilon cost
## 0.48 0.6
##
## - best performance: 816483.1
plot(tuneResult3)
set.seed(123)
tuneResult4 <- tune(svm, amount_spent~., data= train,
ranges = list(epsilon = seq(.485,.51,0.005), cost = seq(.05,.15,.1))
)
print(tuneResult4)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## epsilon cost
## 0.505 0.15
##
## - best performance: 815765.7
plot(tuneResult4)
set.seed(500)
tuneResult5 <- tune(svm, amount_spent~., data= train,
ranges = list(epsilon = seq(.5,.52,0.002), cost = seq(.14,.2,.01))
)
print(tuneResult5)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## epsilon cost
## 0.516 0.15
##
## - best performance: 813626.9
plot(tuneResult5)
model.svm.bestTune <- tuneResult5$best.model
compare_models(test, train, test_outcome, model.svm.bestTune)
| MAE | MSE | RMSE | RAE | |
|---|---|---|---|---|
| Train Set | 501.44 | 794100.7 | 891.12 | 97.83 |
| Test Set | 554.21 | 1297557.1 | 1139.10 | 97.38 |
# # perform a grid search
# set.seed(123)
# tuneResult.log <- tune(svm, log(amount_spent+1)~., data= train,
# ranges = list(epsilon = seq(0,1,0.1), cost = 2^(2:9))
# )
# print(tuneResult.log)
# # Draw the tuning graph
# plot(tuneResult.log)
#
# set.seed(123)
# tuneResult2.log <- tune(svm, log(amount_spent+1)~., data= train,
# ranges = list(epsilon = seq(.42,.55,0.01), cost = seq(2,8,1))
# )
# print(tuneResult2.log)
# plot(tuneResult2.log)
#
# set.seed(123)
# tuneResult3.log <- tune(svm, log(amount_spent+1)~., data= train,
# ranges = list(epsilon = seq(.48,.53,0.01), cost = seq(.1,3.1,.5))
# )
# print(tuneResult3.log)
# plot(tuneResult3.log)
#
# set.seed(123)
# tuneResult4.log <- tune(svm, log(amount_spent+1)~., data= train,
# ranges = list(epsilon = seq(.495,.51,0.005), cost = seq(.05,.15,.1))
# )
# print(tuneResult4.log)
# plot(tuneResult4.log)
#
# set.seed(123)
# tuneResult5.log <- tune(svm, log(amount_spent+1)~., data= train,
# ranges = list(epsilon = seq(.505,.52,0.002), cost = seq(.01,.06,.01))
# )
# print(tuneResult5.log)
# plot(tuneResult5.log)
set.seed(123)
tuneResult6.log <- tune(svm, log(amount_spent+1)~., data= train,
ranges = list(epsilon = seq(.528,.552,0.003), cost = seq(.01,.03,.01))
)
print(tuneResult6.log)
##
## Parameter tuning of 'svm':
##
## - sampling method: 10-fold cross validation
##
## - best parameters:
## epsilon cost
## 0.534 0.02
##
## - best performance: 4.006164
plot(tuneResult6.log)
model.svm.log.bestTune <- tuneResult6.log$best.model
compare_models(test, train, test_outcome, model.svm.log.bestTune)
| MAE | MSE | RMSE | RAE | |
|---|---|---|---|---|
| Train Set | 506.75 | 1070716 | 1034.75 | 98.86 |
| Test Set | 526.88 | 1578593 | 1256.42 | 92.58 |
## since heavily positive-skewed and bounded at 0, trying log of amount spent
set.seed(123)
model.log <- train(log(amount_spent+1) ~., train, method='lm')
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
set.seed(123)
model.svm.tuned <- svm(amount_spent ~., data=train, epsilon=.51, cost=.16)
set.seed(123)
model.svm.log.tuned <- svm(log(amount_spent+1) ~., data=train, epsilon=.549, cost=.01)
set.seed(123)
model.svm.simple <- svm(amount_spent~., data = train)
set.seed(123)
model.lm <- train(amount_spent ~., train, method='lm')
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
compare_models(test, train, test_outcome, model.svm.simple)
| MAE | MSE | RMSE | RAE | |
|---|---|---|---|---|
| Train Set | 391.97 | 801454.8 | 895.24 | 76.47 |
| Test Set | 476.60 | 1370485.6 | 1170.68 | 83.74 |
compare_models(test, train, test_outcome, model.svm.tuned)
| MAE | MSE | RMSE | RAE | |
|---|---|---|---|---|
| Train Set | 498.94 | 793204.5 | 890.62 | 97.34 |
| Test Set | 551.77 | 1297742.5 | 1139.19 | 96.95 |
compare_models(test, train, test_outcome, model.svm.log.tuned)
| MAE | MSE | RMSE | RAE | |
|---|---|---|---|---|
| Train Set | 506.77 | 1070752 | 1034.77 | 98.87 |
| Test Set | 526.90 | 1578623 | 1256.43 | 92.58 |
compare_models(test, train, test_outcome, model.lm)
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
| MAE | MSE | RMSE | RAE | |
|---|---|---|---|---|
| Train Set | 506.60 | 793190.6 | 890.61 | 98.83 |
| Test Set | 549.55 | 1284606.8 | 1133.40 | 96.56 |
compare_models(test, train, test_outcome, model.log)
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
| MAE | MSE | RMSE | RAE | |
|---|---|---|---|---|
| Train Set | 506.92 | 1070943 | 1034.86 | 98.90 |
| Test Set | 527.05 | 1578788 | 1256.50 | 92.61 |
## check for null review text
which(!complete.cases(reviews$text))
## integer(0)
reviews <- reviews %>%
mutate(
text = as.character(text),
valence = factor(ifelse(reviews$star<3, "Negative", "Positive"))
)
summary(reviews)
## star text valence
## Min. :1.00 Length:75 Negative: 7
## 1st Qu.:3.00 Class :character Positive:68
## Median :4.00 Mode :character
## Mean :4.04
## 3rd Qu.:5.00
## Max. :5.00
ggplot(reviews, aes(str_length(text), fill=factor(star))) +
geom_boxplot() +
ggtitle("review length ~ star rating")
## create corpus and include star rating for segmenting
reviews.corpus <- corpus(reviews$text)
docvars(reviews.corpus, "star") <- reviews$star
docvars(reviews.corpus, "valence") <- reviews$valence
reviews.dfm <- dfm(reviews.corpus,
remove=stopwords('english'),
remove_punct=TRUE,
remove_symbols=TRUE,
remove_separators=TRUE,
)
reviews.dfm.valence <- dfm(reviews.corpus,
remove=stopwords('english'),
remove_punct=TRUE,
remove_symbols=TRUE,
remove_separators=TRUE,
groups = 'valence'
)
reviews.dfm.star <- dfm(reviews.corpus,
remove=stopwords('english'),
remove_punct=TRUE,
remove_symbols=TRUE,
remove_separators=TRUE,
groups = 'star'
)
reviews.tfidf <- dfm_tfidf(reviews.dfm)
set.seed(100)
textplot_wordcloud(reviews.dfm,
min_count=3,
random_order = FALSE,
rotation=.25,
color = RColorBrewer::brewer.pal(8,"Dark2")
)
textplot_wordcloud(reviews.dfm.valence,
comparison=TRUE,
min_count=3
)
textplot_wordcloud(reviews.dfm.star,
comparison=TRUE,
min_count=3
)
reviews.keyness <- textstat_keyness(reviews.dfm, target=reviews$valence=="Positive")
textplot_keyness(reviews.keyness, margin=.1, n=13)
I used Quanteda and SeededLDA to practice a different LDA implementation. A knitting error is preventing me from including the code. The output in RStudio is:
| topic1 | topic2 | topic3 |
|---|---|---|
| movie | time | food |
| great | back | just |
| place | moviemagic | go |
| food | cinema | like |
| fun | movies | popcorn |
| good | also | get |
| love | really | beer |
| can | going | ordered |
| pizza | good | really |
| theater | awesome | got |
model_lda <- textmodel_lda(reviews.dfm, k=3)
# as.data.frame(terms(model_lda, 10))
# perform a Latent Dirichlet Analysis (several lines of code to get you started)
# first remove stop words
corpus <- VCorpus(VectorSource(reviews$text))
# a function to clean /,@,\\,|
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
corpus <- tm_map(corpus, toSpace, "/|@|\\|")
corpus<- tm_map(corpus, stripWhitespace) # remove white space
# covert all to lower case else same word as lower and uppercase will classified as different
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers) # remove numbers
corpus <- tm_map(corpus, removePunctuation) # remove punctuations
corpus <- tm_map(corpus, removeWords, stopwords("en"))
dtm <- DocumentTermMatrix(corpus)
set.seed(234)
rowTotals <- apply(dtm , 1, sum)
dtm <- dtm[rowTotals> 0, ]
lda <- LDA(dtm, k = 3, method = "Gibbs", control = NULL)
topics <- tidytext::tidy(lda, matrix = "beta") # beta is the topic-word density
## Warning: `tbl_df()` is deprecated as of dplyr 1.0.0.
## Please use `tibble::as_tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
top_terms <- topics %>%
group_by(topic) %>%
top_n(10, beta) %>% # top_n picks 10 topics.
ungroup() %>%
arrange(topic, -beta)
top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip()
# perplexity calculation - change k = values
lda.def <- LDA(dtm, k = 4, control = NULL)
perplexity(lda.def)
## [1] 367.3747
summary(model.lm.fullset)
##
## Call:
## lm(formula = amount_spent ~ ., data = purchases)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1131.3 -410.4 -269.8 22.3 9575.1
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 243.68847 180.06384 1.353 0.17610
## age 4.78582 2.75768 1.735 0.08282 .
## jobblue-collar -19.03636 76.06472 -0.250 0.80241
## jobentrepreneur 135.54936 147.53145 0.919 0.35832
## jobmanagement 105.30974 101.24757 1.040 0.29841
## jobretired -130.82587 144.88928 -0.903 0.36667
## jobself-employed -19.70345 157.38334 -0.125 0.90038
## jobservices 0.06220 87.98871 0.001 0.99944
## jobstudent -92.11807 200.99827 -0.458 0.64679
## jobtechnician -14.76209 85.46043 -0.173 0.86288
## jobunemployed 246.57788 154.07094 1.600 0.10967
## jobunknown 33.40031 408.14602 0.082 0.93479
## jobwait-staff 590.69694 223.53039 2.643 0.00829 **
## maritalmarried -43.62729 66.70185 -0.654 0.51315
## maritalsingle -61.88122 78.09725 -0.792 0.42825
## educationsecondary 54.10943 62.70206 0.863 0.38826
## educationtertiary 264.07703 92.07081 2.868 0.00417 **
## educationunknown 63.18623 103.52489 0.610 0.54170
## seen_aloneyes -249.44346 147.38601 -1.692 0.09072 .
## discountyes 10.38529 71.48502 0.145 0.88451
## days_member 0.11692 0.08768 1.334 0.18251
## movies_seen -6.59583 18.92702 -0.348 0.72751
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 973.2 on 1978 degrees of freedom
## Multiple R-squared: 0.02532, Adjusted R-squared: 0.01497
## F-statistic: 2.447 on 21 and 1978 DF, p-value: 0.0002671
Of the 8 predictors, which predictors have a significant influence on amount spent on concessions? Running a simple linear regression including all the original variables and no pre-processing yields a model showing 2 variables are highly correlated with concession purchasing amount (p<.01):
In my opinion, these results should not be acted upon, as the model is only able to explain 2.5% of the variability of amount someone spends on consessions. I used the model model.lm.fullset to answer this question.
Which predictors have a positive influence and which predictors have a negative influence on the amount spent on concessions? Which analysis, regression or SVR, helped you answer this question?
These variables have a positive correlation or influence with the amount spent on consessions:
These variables have a negative correlation or influence with the amount spent on consessions:
Since the intercept is positive, you can say holding all variables at 0, the default state has a positive influence. Since we one-hot encoded, these variables would be:
I used a white-box, linear regression model to answer this question. SVR is a black box model and would not show coefficient values.
Given the significant predictors, what strategies can MovieMagic come up with to increase amount spent on concessions?
compare_models(test, train, test_outcome, model.svm.simple)
| MAE | MSE | RMSE | RAE | |
|---|---|---|---|---|
| Train Set | 391.97 | 801454.8 | 895.24 | 76.47 |
| Test Set | 476.60 | 1370485.6 | 1170.68 | 83.74 |
compare_models(test, train, test_outcome, model.svm.tuned)
| MAE | MSE | RMSE | RAE | |
|---|---|---|---|---|
| Train Set | 498.94 | 793204.5 | 890.62 | 97.34 |
| Test Set | 551.77 | 1297742.5 | 1139.19 | 96.95 |
compare_models(test, train, test_outcome, model.svm.log.tuned)
| MAE | MSE | RMSE | RAE | |
|---|---|---|---|---|
| Train Set | 506.77 | 1070752 | 1034.77 | 98.87 |
| Test Set | 526.90 | 1578623 | 1256.43 | 92.58 |
compare_models(test, train, test_outcome, model.lm)
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
| MAE | MSE | RMSE | RAE | |
|---|---|---|---|---|
| Train Set | 506.60 | 793190.6 | 890.61 | 98.83 |
| Test Set | 549.55 | 1284606.8 | 1133.40 | 96.56 |
compare_models(test, train, test_outcome, model.log)
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
## Warning in predict.lm(modelFit, newdata): prediction from a rank-deficient fit
## may be misleading
| MAE | MSE | RMSE | RAE | |
|---|---|---|---|---|
| Train Set | 506.92 | 1070943 | 1034.86 | 98.90 |
| Test Set | 527.05 | 1578788 | 1256.50 | 92.61 |
Which analysis, linear regression or SVR, provides better prediction? Which metric would you focus on to support your answer?
I looked at both MAE and RMSE when comparing models. The simple SVR model, model.svm.simple, which used default epsilon and cost, had the lowest MAE of all models - $476.60. The linear regression, model.lm, however, had the lowest RMSE - $1133.40. What this tells me is model.svm.simple had greater variance of error - meaning when it was wrong, it was wrong by more than when model.lm was wrong. Which metric (and hence which model) to use will depend on the application of the prediction and how undesirable residual variance is.
While not shown here, RMSLE might be a better metric to compare models since it might account better for the curve of amount spent - if we predict someone will spend $50 when they spent $5, that’s a different situatin than if we predicted someone would spend $3050 when they actually spent $3000. RMSLE might be more appropriate if we’re looking at the whole range of spending.
textplot_wordcloud(reviews.dfm.valence,
comparison=TRUE,
min_count=3
)
textplot_wordcloud(reviews.dfm.star,
comparison=TRUE,
min_count=3
)
textplot_keyness(reviews.keyness, margin=.1, n=13)
MovieMagic wants to visualize the reviews through a wordcloud and wants to find out which words are used most commonly in the reviews that customers write for MovieMagic. Create 2 wordclouds - one for reviews that received 3, 4, or 5 star ratings and another with reviews that received 1 or 2 stars ratings. Knowing the prominent words in each of the wordclouds, what strategies can be developed in messaging customers? Would the strategies differ?
It appears waiting for food may be one factor in a customer’s review of the theatre. I would suggest looking into measuring duratin between order submission and food delivery. This might help determine how to reduce poor reviews around consessions purchases. Such an analysis might focus on if certain items take longer to prepare, potentially causing people to miss their movie. Other factors might also influence these poor experiences such as number of employees working consessions at the time, time of day, number of movies starting within 15 minutes, or availability of certain items.
Popcorn was used in positive reviews. You might explore whether a separate line for popcorn only might increase these particular sales. I would suggest looking at whether other item sales decreae when this popcorn-only line is available. While this may gain sales for those who would be otherwise deterred from long lines, it may also decrease sales of items which would have otherwise been purchased concurrently.
### from quanteda package
# as.data.frame(terms(model_lda, 10))
### from topic models pkg
top_terms %>%
mutate(term = reorder(term, beta)) %>%
ggplot(aes(term, beta, fill = factor(topic))) +
geom_col(show.legend = FALSE) +
facet_wrap(~ topic, scales = "free") +
coord_flip()
Additional SeededLDA topic output:
| topic1 | topic2 | topic3 |
|---|---|---|
| movie | time | food |
| great | back | just |
| place | moviemagic | go |
| food | cinema | like |
| fun | movies | popcorn |
| good | also | get |
| love | really | beer |
| can | going | ordered |
| pizza | good | really |
| theater | awesome | got |
MovieMagic also wants to use topic modeling to find out whether the content in the reviews could be categorized into specific topics. If you used LDA to create 3 topic groups (k = 3) what would be the titles you would assign to these 3 topics. MovieMagic wants you to use the words within the topic to infer topic title. Given the topics you inferred what strategies would you suggest are possible for MovieMagic if it wants to increase concession sales. Would you recommend promotions or advertising or loyalty program; justify your choice of strategy?
Topic 1 - Classic Date Night
Need to get back to basics with your beau? We have a full menu for your and yours to indulge over while snuggled up in front of the big screen. Suggestions: When the alternative for date night includes finding and paying for both dinner and the movie and rushing in traffic to make sure you arrive on time, make it an easy choice for couples to dine at the teatre instead. This might include offering higher end meals paired with wines or a quiet, more upscale seating area where kids are not allowed for couples to reconnect before the movie. Everyone loves a good wine pairing.
Topic 2 - The Magic of Movies -or- Wholesome Family Tradition
Grab some popcorn and snacks and enjoy a fun flick with the whole family. Suggestions: Make it easy for parents bringing a kid or two -or even their 4 or 5 friends! Put together ‘Family Bundles’ with the most frequently co-purchased items catering to families or easy low-choice selections - each kid chooses 1 snack and 1 drink from a list. Take out the stress of coordinating food selections.
Topic 3 - Kick Back
Powerful sound system, full range high density graphics, and comfortable recliners to enjoy some cold beer and hot pizza with friends - this sounds like a place to relieve some stress. If you haven’t yet, look to expand your beer offerings! If there’s a better selection than competitors, you might be able to charge more for simply having a better selection available. Look to offer a discount when purchasing beer with food, especially in bulk - free 6th beer when bought with 2 pizzas, for example.